Analysis Uber data in New York City

Team member:

Abdulrahman Alsalamah

Problem statement

In the new york city There are a lot of people using Uber and a taxi to get around the city and go to work, companies need to know the movement of people to deal with the time and how many order in the day, To know the traffic and to increase the company's profits and serve customers better and right in time.

In [1]:
# import required library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import geopandas


sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

EDA

In [2]:
# read csv file 

uber = pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-apr14.csv')
In [3]:
# see top 5
uber.head()
Out[3]:
Date/Time Lat Lon Base
0 4/1/2014 0:11:00 40.7690 -73.9549 B02512
1 4/1/2014 0:17:00 40.7267 -74.0345 B02512
2 4/1/2014 0:21:00 40.7316 -73.9873 B02512
3 4/1/2014 0:28:00 40.7588 -73.9776 B02512
4 4/1/2014 0:33:00 40.7594 -73.9722 B02512
In [4]:
#there is 564516 rows and 4 columns
uber.shape
Out[4]:
(564516, 4)
In [5]:
# info of data
uber.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564516 entries, 0 to 564515
Data columns (total 4 columns):
Date/Time    564516 non-null object
Lat          564516 non-null float64
Lon          564516 non-null float64
Base         564516 non-null object
dtypes: float64(2), object(2)
memory usage: 17.2+ MB
In [6]:
# how many null value in data
uber.isnull().sum()
Out[6]:
Date/Time    0
Lat          0
Lon          0
Base         0
dtype: int64
In [7]:
# plot the data to see null value
plt.figure(figsize=(9,9))
sns.heatmap(uber.isnull(), annot=False,cbar=False,yticklabels=False,cmap='viridis')
plt.show()
In [8]:
#convert date/time to timestamp
uber['Date/Time'] = pd.to_datetime(uber['Date/Time'] ,format='%m/%d/%Y %H:%M:%S')
In [9]:
uber.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564516 entries, 0 to 564515
Data columns (total 4 columns):
Date/Time    564516 non-null datetime64[ns]
Lat          564516 non-null float64
Lon          564516 non-null float64
Base         564516 non-null object
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 17.2+ MB
In [10]:
# change column name
uber = uber.rename(columns={'Date/Time':'date_time','Lat':'lat', 'Lon':'lon','Base':'base'})
In [11]:
uber.head()
Out[11]:
date_time lat lon base
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512
In [12]:
# print the date info
print('day',uber.date_time[0])
print('day',uber.date_time[0].day)
print('month',uber.date_time[0].month)
print('year',uber.date_time[0].year)
print('second',uber.date_time[0].second)
print('minute',uber.date_time[0].minute)
print('hour',uber.date_time[0].hour)
print('day of week',uber.date_time[0].dayofweek)
day 2014-04-01 00:11:00
day 1
month 4
year 2014
second 0
minute 11
hour 0
day of week 1
In [13]:
# add new columns

uber['day_of_month'] = uber.date_time.dt.day
uber['month'] = uber.date_time.dt.month
uber['year']  = uber.date_time.dt.year
uber['second'] = uber.date_time.dt.second
uber['minute'] = uber.date_time.dt.minute
uber['hour'] = uber.date_time.dt.hour
uber['day_of_week'] = uber.date_time.dt.dayofweek
uber['name_of_day'] = uber.date_time.dt.weekday_name
In [14]:
uber.head(10)
Out[14]:
date_time lat lon base day_of_month month year second minute hour day_of_week name_of_day
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 1 4 2014 0 11 0 1 Tuesday
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 1 4 2014 0 17 0 1 Tuesday
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 1 4 2014 0 21 0 1 Tuesday
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 1 4 2014 0 28 0 1 Tuesday
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 1 4 2014 0 33 0 1 Tuesday
5 2014-04-01 00:33:00 40.7383 -74.0403 B02512 1 4 2014 0 33 0 1 Tuesday
6 2014-04-01 00:39:00 40.7223 -73.9887 B02512 1 4 2014 0 39 0 1 Tuesday
7 2014-04-01 00:45:00 40.7620 -73.9790 B02512 1 4 2014 0 45 0 1 Tuesday
8 2014-04-01 00:55:00 40.7524 -73.9960 B02512 1 4 2014 0 55 0 1 Tuesday
9 2014-04-01 01:01:00 40.7575 -73.9846 B02512 1 4 2014 0 1 1 1 Tuesday

visualization

In [33]:
# plot pick up in the week day

uber_weekdays = uber.pivot_table(index=['name_of_day','day_of_week'],
                                  values='base',
                                  aggfunc='count').sort_values('day_of_week')

uber_weekdays.plot(kind='bar', figsize=(14,10),legend=False)
plt.ylabel('Number of pick up')
plt.title('Uber Pick Up in Week Day',size = 25);

Note

Looking at the data we have, we note that the days of the middle of the week are the most requested, and at the weekend the orders decrease

In [34]:
# plot box plot to see number of pick up in hour
plt.figure(figsize=(16,12))
plt.title('Pick up in hours',size=25)
sns.boxplot(x='name_of_day',y='hour',data=uber, palette=['deepskyblue','palegreen','sandybrown'],saturation=.6)
plt.xlabel('Hour',size=20)
plt.ylabel('Name of day',size=20)
plt.show()

Note

In this graph, we notice that the most time to order is between 10 and 18 o'clock

In [35]:
# plot traffic in week day
sns.set(style="darkgrid")
plt.figure(figsize=(15,8))
plt.title('Traffic in week day',size=25)
ax = sns.countplot(x="day_of_month", data=uber,color='blue')

Note

Looking at the data for the month, we notice that there has been an increase and expansion for a number of Uber pick up

In [36]:
# plot traffic in day hour
sns.set(style="darkgrid")
plt.figure(figsize=(15,8))
plt.title('Traffic in day hour',size=25)
ax = sns.countplot(x="hour", data=uber,color='blue')

Note

In this graph, we notice that the most time to order is between 14 and 22 o'clock

In [37]:
# set index date time
uber.set_index('date_time')
Out[37]:
lat lon base day_of_month month year second minute hour day_of_week name_of_day
date_time
2014-04-01 00:11:00 40.7690 -73.9549 B02512 1 4 2014 0 11 0 1 Tuesday
2014-04-01 00:17:00 40.7267 -74.0345 B02512 1 4 2014 0 17 0 1 Tuesday
2014-04-01 00:21:00 40.7316 -73.9873 B02512 1 4 2014 0 21 0 1 Tuesday
2014-04-01 00:28:00 40.7588 -73.9776 B02512 1 4 2014 0 28 0 1 Tuesday
2014-04-01 00:33:00 40.7594 -73.9722 B02512 1 4 2014 0 33 0 1 Tuesday
... ... ... ... ... ... ... ... ... ... ... ...
2014-04-30 23:22:00 40.7640 -73.9744 B02764 30 4 2014 0 22 23 2 Wednesday
2014-04-30 23:26:00 40.7629 -73.9672 B02764 30 4 2014 0 26 23 2 Wednesday
2014-04-30 23:31:00 40.7443 -73.9889 B02764 30 4 2014 0 31 23 2 Wednesday
2014-04-30 23:32:00 40.6756 -73.9405 B02764 30 4 2014 0 32 23 2 Wednesday
2014-04-30 23:48:00 40.6880 -73.9608 B02764 30 4 2014 0 48 23 2 Wednesday

564516 rows × 11 columns

In [39]:
import plotly.express as px

fig = px.scatter_mapbox(uber, lat="lat", lon="lon", hover_name="date_time",
                        color_discrete_sequence=["fuchsia"], zoom=8, height=600)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Note

In this plot, we note that the most requested place was Manhattan in the city center, and we note the simple request in the outskirts of the city

In [ ]:
# another plot you can try it

# fig = px.scatter_mapbox(uber, lat="Lat", lon="Lon", hover_name="date_time",
#                         color_discrete_sequence=["fuchsia"], zoom=8, height=600)
# fig.update_layout(mapbox_style="carto-darkmatter")
# fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()
In [41]:
uber.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564516 entries, 0 to 564515
Data columns (total 12 columns):
date_time       564516 non-null datetime64[ns]
lat             564516 non-null float64
lon             564516 non-null float64
base            564516 non-null object
day_of_month    564516 non-null int64
month           564516 non-null int64
year            564516 non-null int64
second          564516 non-null int64
minute          564516 non-null int64
hour            564516 non-null int64
day_of_week     564516 non-null int64
name_of_day     564516 non-null object
dtypes: datetime64[ns](1), float64(2), int64(7), object(2)
memory usage: 51.7+ MB
In [42]:
uber['lon'].loc[uber['lon'] > -73.8] = -73.8
uber['lon'].loc[uber['lon'] < -74.1] = -74.1
uber['lat'].loc[uber['lat'] > 40.9] = 40.9
uber['lat'].loc[uber['lat'] < 40.6] = 40.6

# another plot 

plt.figure(figsize=(16,16))
sns.set_style("darkgrid", {'axes.grid' : False})
plt.scatter(np.array(uber['lon']), np.array(uber['lat']), s=0.2, alpha=0.6)
plt.xlim([-74.1,-73.8])
plt.ylim([40.6, 40.9])
plt.show()
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:205: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [ ]:
# from keplergl import KeplerGl 
# map_1 = KeplerGl(height=500)
# map_1.add_data(data=uber[['date_time','Lat','Lon']], name='data_1')
# map_1
In [43]:
# using Uber framework to plot 
from keplergl import KeplerGl 
map_1 = KeplerGl(height=800,zoom = 7.5)
map_1.add_data(data=uber[['date_time','lat','lon']], name='data_1')

map_1
User Guide: https://github.com/keplergl/kepler.gl/blob/master/docs/keplergl-jupyter/user-guide.md
In [44]:
uber.head()
Out[44]:
date_time lat lon base day_of_month month year second minute hour day_of_week name_of_day
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 1 4 2014 0 11 0 1 Tuesday
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 1 4 2014 0 17 0 1 Tuesday
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 1 4 2014 0 21 0 1 Tuesday
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 1 4 2014 0 28 0 1 Tuesday
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 1 4 2014 0 33 0 1 Tuesday
In [54]:
# create new data frame to create heatmap
new_data = uber.groupby(['name_of_day','hour']).sum()['month']
In [55]:
new_data.reset_index()
Out[55]:
name_of_day hour month
0 Friday 0 5468
1 Friday 1 3040
2 Friday 2 2052
3 Friday 3 2944
4 Friday 4 3728
... ... ... ...
163 Wednesday 19 31176
164 Wednesday 20 31132
165 Wednesday 21 27684
166 Wednesday 22 19380
167 Wednesday 23 10284

168 rows × 3 columns

In [56]:
new_data.head(10)
Out[56]:
name_of_day  hour
Friday       0        5468
             1        3040
             2        2052
             3        2944
             4        3728
             5        5528
             6       11344
             7       15772
             8       14592
             9       10928
Name: month, dtype: int64
In [57]:
new_data = pd.DataFrame(new_data)
In [58]:
new_data = new_data.reset_index()
In [59]:
new_data = new_data.pivot("hour","name_of_day")
In [60]:
new_data
Out[60]:
month
name_of_day Friday Monday Saturday Sunday Thursday Tuesday Wednesday
hour
0 5468 2072 12108 18168 3168 3060 3596
1 3040 1044 9916 11744 1836 1468 2028
2 2052 952 6308 6360 1368 1216 1484
3 2944 2284 4052 4208 2268 2064 2340
4 3728 4084 2824 2740 3444 3548 4012
5 5528 6476 2816 2372 5816 6936 7960
6 11344 11896 3376 2676 12716 15064 16920
7 15772 15552 4440 3492 16636 21216 22588
8 14592 12552 5488 4932 14464 18376 20968
9 10928 8844 7056 7080 10616 11848 15384
10 10396 7812 8344 8452 9480 11600 15376
11 11264 7716 9260 9440 10064 11796 15556
12 11912 7780 10240 9912 10628 11276 15952
13 14140 9176 10740 11052 13204 14224 17876
14 16348 12468 12168 11736 16332 17956 21752
15 21416 15272 17828 13600 20728 24168 28284
16 25036 19848 21640 13956 24596 30084 32852
17 27160 22296 22232 12616 27804 33188 36604
18 29032 18900 24660 11180 26548 28356 33336
19 24988 17544 22116 10316 23716 25836 31176
20 20660 14292 19168 9104 25380 25240 31132
21 25060 12316 23244 9240 26340 23972 27684
22 26832 7904 25972 6556 21480 14456 19380
23 21572 4364 22876 4072 11636 7792 10284
In [61]:
# change the name of columns
new_data.columns = ['Friday','Monday','Saturday','Sunday','Thursday','Tuesday','Wednesday']
In [62]:
# sort columns
new_data.reset_index()
new_data.head()
new_data = new_data[['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]
In [63]:
new_data.columns
Out[63]:
Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')
In [64]:
# plot heatmap
plt.figure(figsize=(14,14))
sns.heatmap(new_data)
plt.show()

Note

In this plot, we note the number of requests at a specific time. On working days, there is an increase in order to go to work in the morning and increase in the evening while leaving work to return home. At the end of the week, we notice at 0 and 1 o'clock, there is a many request at night to go home

In [65]:
uber.head()
Out[65]:
date_time lat lon base day_of_month month year second minute hour day_of_week name_of_day
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 1 4 2014 0 11 0 1 Tuesday
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 1 4 2014 0 17 0 1 Tuesday
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 1 4 2014 0 21 0 1 Tuesday
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 1 4 2014 0 28 0 1 Tuesday
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 1 4 2014 0 33 0 1 Tuesday

Time Series Analysis

In [73]:
# collect more data
concat_uber = pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-apr14.csv')
In [74]:
concat_uber = pd.concat([concat_uber,pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-may14.csv')])
In [75]:
concat_uber = pd.concat([concat_uber,pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-jun14.csv')])
In [76]:
concat_uber = pd.concat([concat_uber,pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-jul14.csv')])
concat_uber = pd.concat([concat_uber,pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-aug14.csv')])
concat_uber = pd.concat([concat_uber,pd.read_csv('uber-pickups-in-new-york-city/uber-raw-data-sep14.csv')])
concat_uber.tail()
Out[76]:
Date/Time Lat Lon Base
1028131 9/30/2014 22:57:00 40.7668 -73.9845 B02764
1028132 9/30/2014 22:57:00 40.6911 -74.1773 B02764
1028133 9/30/2014 22:58:00 40.8519 -73.9319 B02764
1028134 9/30/2014 22:58:00 40.7081 -74.0066 B02764
1028135 9/30/2014 22:58:00 40.7140 -73.9496 B02764
In [77]:
concat_uber.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4534327 entries, 0 to 1028135
Data columns (total 4 columns):
Date/Time    object
Lat          float64
Lon          float64
Base         object
dtypes: float64(2), object(2)
memory usage: 173.0+ MB
In [78]:
# make timestamp
concat_uber['Date/Time'] = pd.to_datetime(concat_uber['Date/Time'] ,format='%m/%d/%Y %H:%M:%S')
In [79]:
concat_uber = concat_uber.set_index('Date/Time').resample('D').count()['Base']
In [81]:
# plot the data
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
concat_uber.plot(figsize=(15, 6))
plt.show()
In [82]:
from pylab import rcParams
import warnings
import itertools
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')

Some distinguishable patterns appear when we plot the data. The time-series has seasonality pattern, such as order are always low at the weekend high at the working day. There is always a strong upward trend within any single week.

I will using a method called time-series decomposition that allow to decompose time series into three distinct components: trend, seasonality, and noise.

In [84]:
import statsmodels.api as sm

rcParams['figure.figsize'] = 18, 8

decomposition = sm.tsa.seasonal_decompose(concat_uber, model='additive')
fig = decomposition.plot()
plt.show()

Time series forecasting with ARIMA

In [85]:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)
In [86]:
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(concat_uber,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue
ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:4216.144919686671
ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:3844.5038910123367
ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:3538.548825486305
ARIMA(0, 0, 0)x(0, 1, 1, 12)12 - AIC:3263.6643338667586
ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:3559.9322390444513
ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:3501.7740847979194
ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:3302.868742127457
ARIMA(0, 0, 0)x(1, 1, 1, 12)12 - AIC:3264.856454309691
ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:4064.1561949320076
ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:3769.894422337914
ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:3432.5393570445044
ARIMA(0, 0, 1)x(0, 1, 1, 12)12 - AIC:3158.693876389045
ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:3798.3882583845316
ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:3753.692375966894
ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:3217.50538142152
ARIMA(0, 0, 1)x(1, 1, 1, 12)12 - AIC:3160.405612678299
ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:3573.6308389764135
ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:3323.934972023236
ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:3482.0032247192034
ARIMA(0, 1, 0)x(0, 1, 1, 12)12 - AIC:3143.8442376718695
ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:3353.257999073224
ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:3323.6219066063186
ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:3234.0077149491553
ARIMA(0, 1, 0)x(1, 1, 1, 12)12 - AIC:3136.503502360383
ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:3551.207700424642
ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:3300.9476629757373
ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:3456.578285729359
ARIMA(0, 1, 1)x(0, 1, 1, 12)12 - AIC:3133.7047637968067
ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:3350.149992788514
ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:3300.679819334945
ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:3226.9666095806615
ARIMA(0, 1, 1)x(1, 1, 1, 12)12 - AIC:3133.6196477548824
ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:3593.8624851531695
ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:3345.3078709335005
ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:3465.4269866606887
ARIMA(1, 0, 0)x(0, 1, 1, 12)12 - AIC:3163.7425044783536
ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:3354.9582276622814
ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:3345.0258926981796
ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:3204.296786129855
ARIMA(1, 0, 0)x(1, 1, 1, 12)12 - AIC:3146.08891205123
ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:3570.7284774801756
ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:3324.880765723923
ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:3422.3700194907433
ARIMA(1, 0, 1)x(0, 1, 1, 12)12 - AIC:3133.5510923922557
ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:3351.3715451220223
ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:3324.3872637093946
ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:3181.54808812721
ARIMA(1, 0, 1)x(1, 1, 1, 12)12 - AIC:3134.3925675996397
ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:3572.3485145117047
ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:3324.3482207183424
ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:3479.789189567126
ARIMA(1, 1, 0)x(0, 1, 1, 12)12 - AIC:3156.59801347067
ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:3333.0215337415225
ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:3323.9591637692993
ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:3208.451526057329
ARIMA(1, 1, 0)x(1, 1, 1, 12)12 - AIC:3120.083930425636
ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:3531.5757795053114
ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:3285.579465177882
ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:3457.3504781579186
ARIMA(1, 1, 1)x(0, 1, 1, 12)12 - AIC:3133.996382313838
ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:3314.7874737286434
ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:3285.759840313312
ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:3208.665557738407
ARIMA(1, 1, 1)x(1, 1, 1, 12)12 - AIC:3133.9470134524286

The above output suggests that SARIMAX(1, 1, 0) x (1, 1, 1, 12) yields the lowest AIC value of 3120.083930425636 Therefore we should consider this to be optimal option.

In [88]:
mod = sm.tsa.statespace.SARIMAX(concat_uber,
                                order=(1, 1, 0),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.1411      0.076      1.853      0.064      -0.008       0.290
ar.S.L12      -0.2034      0.086     -2.366      0.018      -0.372      -0.035
ma.S.L12      -0.9243      0.048    -19.147      0.000      -1.019      -0.830
sigma2      2.298e+07   1.44e-09    1.6e+16      0.000     2.3e+07     2.3e+07
==============================================================================
In [89]:
results.plot_diagnostics(figsize=(16, 8))
plt.show()
In [91]:
pred = results.get_prediction(start=pd.to_datetime('2014-07-01'), dynamic=False)
pred_ci = pred.conf_int()

ax = concat_uber['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))

ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('Number of order')
plt.legend()

plt.show()
In [92]:
y_forecasted = pred.predicted_mean
y_truth = concat_uber['2014-07-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 19330209.27
In [93]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error of our forecasts is 4396.61
In [94]:
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()

ax = concat_uber.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')

plt.legend()
plt.show()

conclusion

In [ ]: